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Abstract 

We present a new fast online clustering algorithm 
that reliably recovers arbitrary-shaped data clus¬ 
ters in high throughout data streams. Unlike the 
existing state-of-the-art online clustering meth¬ 
ods based on k-means or k-medoid, it does not 
make any restrictive generative assumptions. In 
addition, in contrast to existing nonparametric 
clustering techniques such as DBScan or Den- 
Stream, it gives provable theoretical guarantees. 
To achieve fast clustering, we propose to repre¬ 
sent each cluster by a skeleton set which is up¬ 
dated continuously as new data is seen. A skele¬ 
ton set consists of weighted samples from the 
data where weights encode local densities. The 
size of each skeleton set is adapted according to 
the cluster geometry. The proposed technique au¬ 
tomatically detects the number of clusters and is 
robust to outliers. The algorithm works for the 
infinite data stream where more than one pass 
over the data is not feasible. We provide theo¬ 
retical guarantees on the quality of the clustering 
and also demonstrate its advantage over the ex¬ 
isting state-of-the-art on several datasets. 


1. Introduction 

Online clustering in massive data streams is becoming im¬ 
portant as data in a variety of fields including social media, 
finance and web applications arrives as a high throughput 
stream. In social networks, detecting and tracking clusters 
or communities is important to analyze evolutionary pat¬ 
terns. Similarly, online clustering can lead to spam and 
fraud detection in web applications such as detecting un¬ 
usual mass activities in email services or online reviews. 


There exist several challenges in developing a good cluster¬ 
ing algorithm in a high throughput online scenario. In real- 
world applications, the number and shape of the clusters 
is typically unknown. The existing state-of-the-art online 
clustering methods with provable theoretical guarantees are 
primarily based on k-means or k-median/medoid, which 
assume apriori knowledge of the number of clusters and 
inherently make strong generative assumptions about the 
clusters. These assumptions force the retrieved clusters to 
be convex, leading to poor clustering for many real-world 
tasks. There exist several nonparametric techniques that do 
not make simplistic generative assumptions, but are mostly 
based on heuristics and lack theoretical guarantees. More¬ 
over, in a true online scenario one needs to deal with con- 
tinuos streams precluding the use of multiple passes over 
data applicable for finite-size streams commonly assumed 
by many techniques. Potential drift in data distribution over 
time is another practical difficulty that needs to be handled 
effectively. Finally, the clustering procedure should be ef¬ 
ficient both in space and time to be able to handle massive 
data streams. 

In this paper we propose a novel Skeleton-based Online 
Clustering (SOC) algorithm to address the above chal¬ 
lenges. The basic idea of SOC is to represent each cluster 
via a compact skeleton set which faithfully captures the ge¬ 
ometry of the cluster. Each skeleton set maintains a small 
random sample from the corresponding cluster in an online 
fashion, which is updated fast using the new data points. 
Each skeleton point is weighted according to local density 
around it. The number of skeleton points is automatically 
adapted to the structure of the cluster in such a way that 
more complicated shapes are approximated by more skele¬ 
ton points. The skeleton sets are updated by a random pro¬ 
cedure which provides robustness in the presence of out¬ 
liers. The proposed algorithm automatically recovers the 
correct number of clusters in the data with high probabil¬ 
ity as more and more data is seen. The update strategy of 
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the skeleton sets also allows the clustering method to au¬ 
tomatically adapt to any drift in data distribution. In SOC, 
clusters can be merged as well as split over time. We also 
provide theoretical guarantees on the quality of clusters ob¬ 
tained from the proposed method. 

1.1. Related work 

In comparison to the huge literature in offline cluster¬ 
ing, work on online clustering has been somewhat limited. 
Most of the existing online clustering algorithms that have 
theoretical guarantees fall under model-based techniques 
such as k-mean, k-median or k-medoid (Guha et al., 2003; 
Ailon et al., 2009; Shindler et al., 2011; Bagirov et al., 
2011). They assume specific shape of the clusters such 
as spheres that trivially leads to their compact representa¬ 
tion using just a few parameters, e.g., center, radius and 
the number of points. However, as discussed before, these 
model based algorithms fail to capture arbitrary clusters in 
the data and can perform poorly. 

There exist several nonparametric clustering methods 
where no assumption is made about the cluster shapes. 
Popular among them are DBScan (Ester et al., 1996), CluS- 
tream (Aggarwal et al., 2003), and DenStream (Cao et al., 
2006). Recent surveys have described several variants of 
these algorithms (de Andrade Silva et al., 2013; Amini 
et al., 2014). The DenStream and CluStream methods cre¬ 
ate microclusters based on local densities, and combine 
them to form bigger clusters over time. However, these 
methods need to periodically perform offline partitioning of 
all the microclusters to form the clusters, which is not suit¬ 
able for online clustering of massive data streams. Leader- 
Follower algorithm is another popular method and there ex¬ 
ist several variants of it (Duda et al., 2000)(Shah & Zaman, 
2010). These techniques typically encode every cluster by 
one center which is updated continuously as new points be¬ 
longing to the cluster are detected. Such a cluster represen¬ 
tation is not rich enough to encode more complex clusters. 
Overall, the main drawback of the above nonparametric on¬ 
line clustering algorithms is that they are mostly based on 
heuristics and lack any theoretical guarantees. They also 
require extensive hand tuning of the parameters. In (Shah 
& Zaman, 2010), the authors assume each cluster to be 
clique in order to provide theoretical guarantees, which is 
very restrictive in real-world. 

Another popular method used in the context of incremental 
clustering is doubling algorithm (Charikar et al., 1997). Its 
standard version encodes every cluster by just one point. 
Furthermore, even though it allows for merging clusters, it 
does not permit to split them. We implement a variant of 
the method, where instead of one center several centers are 
kept per cluster. As we will show in experimental section, 
this purely deterministic approach, even though with some 


theoretical guarantees, is too sensitive to outliers. 

Our proposed SOC algorithm shares a few similarities with 
two existing techniques: CURE algorithm (Guha et al., 
2001), and core-set (Badoiu et al., 2002). In CURE, sim¬ 
ilar to SOC, each cluster is represented by a random sam¬ 
ple of data instead of just one center to handle arbitrary 
cluster shapes. CURE, however, is an offline hierarchi¬ 
cal agglomerative clustering approach with running time 
quadratic in the size of data, which is too slow for online 
applications. In core-set based clustering, the aim is to en¬ 
code a complicated cluster shape via a compact sample of 
points. The existing state-of-the-art algorithms that use the 
idea of the core-set (Gonzalez, 1985; Alon et al., 2000; Har- 
peled & Varadarajan, 2001; Badoiu et al., 2002) are com¬ 
putationally too intensive to be useful for online clustering 
in practice. The running time is exponential in the number 
of stored skeleton points. Furthermore, the variants that 
give provable theoretical guarantees are inherently offline 
methods that often require several passes over the data to 
produce good-quality clustering. For instance, the algo¬ 
rithm presented in (Badoiu et al., 2002) needs to be rerun 
20 (fc) log(n) times, where k is the size of the core-set an n 
is dataset size. 

Nonparameteric graph-based techniques such as spectral 
clustering can recover arbitrary shaped clusters but they are 
appropriate mainly for the offline setting (Ng et al., 2001). 
Moreover, they also assume a priori knowledge of the num¬ 
ber of clusters. Several relaxations such as iterative biclus¬ 
tering have been proposed to overcome the need to know 
the number of clusters apriori but these methods cannot 
be extended to an online setting. Recently, there has been 
some work on incremental spectral clustering which essen¬ 
tially iteratively modifies the Graph Laplacian (Ning et al., 
2010; Langone et al., 2014; Jia, 2012; Chia et al., 2009). 
In a true online setting, however, even building a good ini¬ 
tial Graph Laplacian is infeasible either due to the lack of 
enough data or the computational bottlenecks. 

2. Clustering Framework 

In this work, since we are interested in retrieving arbitrary- 
shaped clusters, it is important to first define what con¬ 
stitutes a good cluster. The traditional model-based tech¬ 
niques assume a global distance measure (e.g., L 2 distance 
in k-means), which restricts one to convex-shaped clusters. 
Instead, the proposed algorithm works in a nonparametric 
setting where clusters are defined by an intuitive notion of 
paths with ’short-edges’. Two points are more likely to be¬ 
long to the same cluster if there exist many paths in the data 
neighborhood graph such that any two consecutive points 
on each path are not ’’far” from each other. Clearly, the 
overall distance between two points can be large but they 
can still be in the same cluster. Such a setting enables us 
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to consider many complicated cluster shapes. To empha¬ 
size, the idea of the neighborhood graph is to understand 
the cluster dehnition we implicitly utilize in this work. We 
do not need to explicitly construct a graph in our approach. 

2.1. Skeleton-based Online Clustering (SOC) 

Algorithm 

The key idea behind the SOC algorithm is to represent each 
cluster via a set of pseudorandom samples called the skele¬ 
ton set. The algorithm stores and constantly updates a set 
of the skeleton sets V = {S'!,..., St}. Note that the size of 
the set V corresponds to the number of clusters, which may 
change over time as new data is seen. Each skeleton set Si 
represents a cluster and consists of a sample {vl,v^.) of 
all the points belonging to that cluster up to time t together 
with some carefully chosen random numbers {m\m \.) 
and weights {w\, Thus, a skeleton set Si is the set 

of elements of the form: for j = 1,2,hi. 

Sometimes we will also call {v\, ...,vl^.} a skeleton set, 
which should be clear from the context. Let us define a 
map ^{Vj) = m*. We denote by Wy the weight of the skele¬ 
ton point y and by niy the corresponding random number. 
Weights encode the local density around a skeleton point. 
We denote by Wi the sum of weights of all the skeleton 
points representing the zth cluster and by W (yf) the sum of 
weights for all the points belonging to set A. The skeleton 
set is updated in such a way that at any given time t skeleton 
points are pseudouniformly distributed in the entire cluster. 
As mentioned before, the number hi of skeleton points of 
the ith cluster (alternative notation: hs, where S stands 
for the skeleton set) is not hxed and can vary over time. 
When the cluster arises, it is initialized with hinit skele¬ 
ton points. In the algorithm we take hinn = 1 and in the 
theoretical section we show how the lower bounds on hinu 
can be translated to strict provable guarantees regarding the 
quality of the produced clustering. The algorithm tries to 
maintain a skeleton set in such a way that between any two 
skeleton points there exists a path of relatively short edges 
consisting entirely of other skeleton points from the same 
skeleton set. If the cluster grows and the average distance 
between skeleton points is too large, the number of skeleton 
points is increased. This number never exceeds H which is 
a given upper bound determining the memory the user is 
inclined to delegate to encode any cluster. 

The overall algorithm works by first initializing the num¬ 
ber of samples stored in each skeleton set. In this work, 
without loss of generality we assume that each skeleton set 
initially has only one sample. We will propose two variants 
of the algorithm - one where only lazy cluster-merging is 
performed (MergeOnlySOC) and the other, where merging 
can be more aggressive since splitting clusters is also al¬ 
lowed (MergeSplitSOC). As we will see in the experimen- 


Input: Infinite data stream V; 

Output: Cluster assignment for the observed data; 
Pick r, a; 7^ •(— 0, ^ 0; 

while true do 
for each Si do 

if exists V £ Si such that Wy < then 
I CheckSplit{G,v,r, Si,V)', 

end 

end 

Read next x £ T>-, 

Compute Ti = Si n B{x, r) for each Si £ V\ 

U^%-, 

for each Ti do 

itW{Ti) > aW^ then 
I U lA £> {iS'i}; 
end 
end 

itU then 

I Merge{x,U,V,G,ry, 

else 

I AddSingleton(x,V ,G)\ 

end 

end 

Algorithm 1: SOC clustering - main procedure 


tal section, the latter produces a good approximation of the 
groundtruth clustering in fewer steps, but at the cost of ex¬ 
tra time needed to check and perform splitting. If Merge¬ 
SplitSOC version is turned on then the algorithm keeps a 
set of undirected graphs G. Each element of G is associated 
with a different skeleton set and encodes the topology of 
connections between points in that set. We denote by Gs 
an element of G associated with the skeleton set S. 

The overview of the SOC algorithm is given in Algorithm 
1. If splitting is turned on, at each time step t, given the 
existing skeleton set for each cluster, the algorithm checks 
if any cluster should be split. Then, as a new data point x 
arrives in the stream, hrst a ball of radius r centered at x 
i.e., B{x,r) is created. Then, the intersection of this ball 
with each existing skeleton set is computed. If the weight 
of the skeleton points in this intersection is more than aWi 
for 0 < a < 1, then x is assumed to belong to that cluster 
and the corresponding skeleton set is updated. Note that it 
is possible that multiple clusters claim the point x, in which 
case, all those clusters are considered for merging. If the 
intersection of the ball with all the skeleton sets is empty, 
then a new singleton cluster is created. 

We will start with describing the MergeOnlySOC variant 
(i.e., assume that the splitting-related procedures: Check- 
Split and UpdatedGraph in Algorithm 1 and Algorithm 2 
are turned off). The MergeSplitSOC variant will discussed 
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Input: Datapoint x, subset U = {S'ii, Si ^.} C p, 
current clustering V, family of graphs Q, radius r; 

Output: Updated V after merging x with clusters from Q\ 
Denote: W = E„6(s,^ u...us, Jns(a:.r) 

dav(x,r) = Eye(SijU...USi^)nS(a;,r) Tv II® “ yl|2’ 

hun = min(Ej=i....,fc hsi-, H); 

Compute: = Ext{Si ., /tun) for j = 1, h. 

Let: 5 ^ Sff}; 

if dav{x, r) < ^ or {hun = H) then 
I SS U {Ext{{x},hun)}\ 

end 

Denote by Vj the jth skeleton point of the ith skeleton set 
of 5; 

initialize: S^erged ^ 0; 
for j = 1 ,hun do 

V ^ argmin{^{v]), ...,£,{vf^))\ 

m ^ Timi{i{v]), ...,^{vf^))\ 
tu -t— ui„; 

Smerged ^ Emerged L) {(u, 771, tu)} 

end 

if dau{x, r) > I and {hun < H) then 
generate r according to Gen{s); 

Emerged ^ Emerged L {(s:, f, 1)}, 

else 

Let Vmin G E, where E € Emerged be such that: 

11® - Vmin\\2 = minsgs^^^g^_j minxes ||® - z\\2; 

tf Xmin ^ X then 

rUvu^iu oiin(m„^.^, r), where r is generated 
according to Gen{s)\ 

end 

end 

Let Qq = {Gs-^,Gsi^ }; 

E i — {E U {S'merged})\{‘Si^ , }; 

UpdatedGraph{Q, x, r, Qo, Emerged)', 

Algorithm 2: SOC clustering - Merge subprocedure 


later in section 2.2. 

2.1.1. Subprocedure Merge 

The goal of the Merge subprocedure is to merge a new 
point X with one or more clusters (Algorithm 2). When 
X is assigned to multiple clusters, it basically acts as a 
linking point to merge them, resulting in a unified cluster. 
This step is crucial in online clustering scenario as points 
from the same cluster may be assigned to different clus¬ 
ters initially but as more evidence builds up from the new 
data, one can combine these clusters to recover the true 
underlying cluster. In the Merge subprocedure the skele¬ 
ton set is updated when a new cluster is constructed. The 
skeleton sets Si ^,..., Si,, representing clusters that will be 
merged are given as input. Before describing the Merge 
procedure let us introduce an important subroutine. Let 
S = {(ui, TOi, wi),..., (ut, TOt, Wt)} be a skeleton set of 
size hs < h for some h > 0. We denote by Ext{S, h) the 


extension of S obtained by adding to S exactly h—hs more 
triples according to the following procedure. Each newly 
added triple has weight 1. Each newly added skeleton point 
is chosen independently at random from the set {ui,..., Ut} 
in such a way that skeleton point Vi is chosen with prob¬ 
ability —. The corresponding random number m is 

'dij 

generated by a pseudorandom number generator Gen{s) 
with seed s. The seed can be initialized randomly or alter¬ 
natively it can be chosen as a function of the skeleton point 
by conceptually partitioning the entire input space with a 
grid of lengh S, and using the id of the cell occupied by 
X in this grid. The latter procedure is useful for infinite 
streams to avoid correlated random sequences for far away 
points in the space. Subroutine Ext can be run also if its 
first argument is a single point x. In this case the output is 
a skeleton set of the form {(a;, ri, 1),..., {x, rh, 1)}, where 
ri,..., is a sequence of h random numbers generated ac¬ 
cording to Gen{s). 

The Merge algorithm computes an average weighted dis¬ 
tance dav(x, r) from the new point to those skeleton points 
from ,..., S'ij, that reside in B{x,r). Next, two cases 
are considered. If dav{x,r) < § or the number of skele¬ 
ton points in all the skeleton sets to be merged is E[ then 
the algorithm decides not to increase the size of the merged 
skeleton set (since the linking point is relatively close to the 
merged clusters or the union of skeleton sets under consid¬ 
eration is already saturated). Denote by hun the minimum 
of H and the total number of skeleton points in all skele¬ 
ton sets to be merged. The merged skeleton set will be of 
size hun- Let us describe how the jth skeleton point of the 
newly formed cluster for j = 1, ■■■, hun will be computed. 
Eirst, each contributing skeleton set is extended to size hun 
by weighted random sampling from it (see procedure Ext 
described above). This is also the case for x (we treat x as 
a skeleton set consisting of hun copies of x). 

We take the skeleton points from all the clusters to be 
merged and x, and choose the new point and the corre¬ 
sponding value as shown in Algorithm (2), using random 
sequence generated for x. Each newly added point has to 
contribute to the weight distribution in the cluster. If point 
X is not in the new skeleton set then the closest point Vmin 
from the skeleton set is found and its weight is increased 
by one {x contributes to the total weight of Vmin)- The new 
skeleton set replaces the skeleton sets of all the clusters that 
are merged. 

Now let us assume that davix, r) > | and the total number 
of all skeleton points in the skeleton sets to be merged is 
smaller than iJ. Intuitivaly speaking, this means that the 
cluster had grown too much (the local density within link¬ 
ing point X is too small) and thus the number of skeleton 
points encoding the cluster has to increase (since the pool 
of the skeleton points that can be used to represent a clus- 
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Inpnt: Datapoint x, current clustering V, family of 
graphs Q\ 

Output: Updated version of V after adding 
cluster-singleton {x}; 

Generate a random number according to Gen{s); 

'P V yj {Snem}: 

0 U {Gsin}', 

Algorithm 3: SOC clustering - AddSingleton subpro¬ 
cedure 


ter has not been used entirely). If this is the case then the 
same procedure as for the previous case is conducted but 
the skeleton set corresponding to x is excluded. Finally, 
X is added as the last skeleton-singleton of weight 1 (and 
with corresponding random number selected according to a 
given random number generator) to the newly formed clus¬ 
ter. 

2.1.2. SVBPROCEDVRE AddSingleton 

The procedure AddSingleton adds a new cluster consisting 
of just X when no existing cluster is found to be close 
enough to x based on the intersection with the skeleton 
sets (Algorithm 3). Next, a skeleton set for this new 
singleton cluster is created. Since a skeleton set aims to 
cover uniformly the entire mass of the cluster using hinit 
random samples, point x is repeated hmit times to form 
the skeleton set for the cluster-singleton. 

Furthermore, a sequence of hinu random values is gener¬ 
ated from Gen{s), and each copy of x in the skeleton set is 
assigned one of the values and weight m = 1 to build the 
hinit triples and complete the skeleton. In the proposed 
implementation we initialize each newly created cluster- 
singleton with only one skeleton point, thus hinit = 1 (see 
Algorithm 3). For the newly created skeleton set an undi¬ 
rected graph-singleton Gsin is created. 

2.2. Splitting clusters 

We now describe the cluster splitting procedure in the 
MergeSplitSOC variant of the algorithm. It is handled by 
two additional procedures: CheckSplit and UpdatedGraph. 

CheckSplit determines whether a given skeleton set should 
be split by looking for a breaking point v, which is the 
skeleton point whose weight is at most half of the aver¬ 
age weight of the points within the skeleton set. If such a 
point V is found then the algorithm determines whether the 
cluster should be split as follows. First, all the points from 
B{v, |) are deleted from the corresponding graph G of the 
skeleton set. A connected component analysis is then con¬ 
ducted on the remaining graph. If more than one connected 
components is found, it means w is a breaking point and 


Input: Family of graphs Q, skeleton point v, radius r, 
skeleton set S corresponding to v, current clustering V\ 
Output: Updated version of C/and P; 

Let Rem = {y £ S ■. \\v — y \\2 < §}; 

Let = Gs\ Rem-, 

Run CC algorithm on to obtain {Ci,..., Ct}; 
if f > 1 then 

Denote by Si the subset of S corresponding to Ci for 

i = 1, 

Update: C/ ■<— (C/ U {Ci, ...,Ct}) \ {Gs} and 
P^{pyj{Si,...,St}\{S}y, 

end 

Algorithm 4: SOC clustering - CheckSplit subroutine 


Input: Family of graphs Q, skeleton point x, radius r, 
subfamily {Gs^,..., Gs^ } (= G, skeleton set Snew', 
Output: Updated version of G', 

Let G = Gsi <8)... <8) Gs ^; 

Let E = {{x,y} : x,y G n B{x, r), 3i^jX G 
Gsi,y £ Gsj}', 

Let Gmerged be a graph obtained from G by adding edges 
from E-, 

Update: G ■(- {G d {G^^rgsd}) \ {Gsi , Gs^ }; 
Algorithm 5: SOC clustering - UpdatedGraph sub¬ 
routine 


the cluster is split so that each connected component forms 
a new cluster and the points in the connected components 
consitute the new skeleton sets. 

The UpdatedGraph procedure is shown in Algorithm 5, 
and is responsible for constructing a graph Gunion for the 
newly formed cluster and replacing with it all graphs cor¬ 
responding to merged skeleton sets. The graph Gunion is 
constructed by combining all elements of Q corresponding 
to the skeleton sets that need to be merged. Those graphs 
are combined by adding edges between skeleton points 
from the newly constructed skeleton set that are in the close 
neighborhood of the linking point x. In the description of 
UpdatedGraph we denote by G = Gi 0 G 2 ... 0 G^ the 
graph with vertex set V{G) = l^(Gi) U ... U V{Gk) and 
edge set E{G) = E{Gi) U ... U E{Gk)- 

3. Theoretical analysis 

In this section we provide theoretical results regarding SOC 
algorithm for the clustering model described in Sec. 2. We 
start by introducing the general mathematical model we are 
about to analyze. It is one of the many variants of planted 
partition models used to construct data with hard clustering 
and outliers. Notice that our algorithm does not require the 
input to be produced according to this model. In particular, 
we do not use any specific parameters of the model in the 
algorithm. 

For a set of Z3-dimensional data, we assume it contains k 







Fast Online Clustering with Randomized Skeleton Sets 


disjoint compact sets , which are called cores and denoted 
as TZi,...,TZk C K-®. The cores are called A-separable 
if the minimum distance between any two cores is greater 
than A, i.e.: 'ii<i<j<k,xGni,yGnj\\x - yh > A. These 
cores can have arbitrary shapes giving rise to the observed 
clusters such that the points in the cluster Ci come from 
core TZi with high probability and from the rest of the space 
with low probability. Formally, given a set of probabilities 
{pi, ...pk}, points in the cluster Ci are sampled from the 
core TZi with probability pi and from outside TZi with prob¬ 
ability 1 — Pi, where pi 3> (1 — Pi). It is important to note 
that even though the cores are separable, this is no longer 
the case for clusters due to the presence of ’’outliers”. In 
other words, short-edge paths between points from differ¬ 
ent clusters may exist, but not many in expectation. We call 
the clustering model presented above as a {TZ,p) -model, 
where p = (pi,...,pfc) and TZ — {TZi, ...,TZk)- This is a 
quite general model which has good-quality clustering of 
cores (because of A-separability). However due to the out¬ 
liers, the task of recovering the clusters is nontrivial even 
in an offline setting. Simple heuristics such as connected- 
components cannot be used to recover the clusters in the 
offline mode. The online setting brings additional algorith¬ 
mic and computational challenges. Below, we give details 
of the proposed clustering mechanism. 

We need the following definition. 

Definition 3.1. A set W C is called to be (s,r)- 
coverable if W can be covered by s balls, each of radius 
r 

Let TTi be the probability that a new point in the data stream 
belongs to cluster Ci. Fix a covering C of TZi with |C| < s 
(for every core TZi). Let B G C he an arbitrary ball from 
the covering C. Furthermore, let p\ be a lower bound on 
the probability that a new point is from set {TZi H B) given 
it belongs to core TZi, which can be expressed as p^ ~ 1 . 

Denote Fi = AiAA- and 7 ^ = —, where = 

pf ’ ^ttd p{ = 1 —Pi. Here Fi is the probability that a 
new point came from a fixed ball of covering C of TZi given 
that it belongs to cluster Ci. Similarly, 7 i is the probabiltiy 
that a new point is an outlier given that it belongs to cluster 
Ci. Since outliers are expected to be lower than the points 
from the cluster cores, 7 i Fi. Denote F = mini Li and 
7 = maxi 7i. 

Since we keep at most H samples in a skeleton set, most of 
the cluster points are not in this set. The error made by the 
algorithm on a new point v is defined as follows: Suppose 
V comes from the core TZi, and gets assigned to a cluster 
that contains points from other cores as well, or there ex¬ 
ists another cluster that also contains points from TZi. Note 
that it is a strict definition of error as in an online setting 
transient overclustering is expected due to lack of enough 



Figure 1. Merge scenario as in Lemma 3.1. Data point v merges 
two clusters: Pi and P 2 . The intersection A n Pi) for some 
i G {1,2} must consist only of outliers (points marked red). 
The other intersection may contain points from the core (points 
marked green). 


data in the early phase. We say that the algorithm reaches 
the saturation phase when each skeleton set reaches size 
H. We are ready to state the main results of our analysis 
regarding the MergeOnlySOC version of the algorithm. 

Theorem 3.1. Assume that we are given a dataset con¬ 
structed according to the (TZ,p)-model with k cores with 
outliers. Cores are A-separable. Assume that each core of 
the cluster is (s, ^)-coverable. Let n be the number of all 
the points seen by the algorithm after the saturation phase 
has been reached. Then with probability at least 1 — Aefor 
htmt = H(max(^ log(^), log(^))), r = f and 

a = j the SOC algorithm will not merge clusters contain¬ 
ing points from different cores in the saturation phase if 
they were not merged earlier. 

Theorem 3.1 gives upper bounds on the minimal number 
of skeleton points per cluster ensuring that MergeOnlySOC 
does not undercluster. As we will see in the experimental 
section, this bound in practice is much lower. We also have 
the following. 

Theorem 3.2. Under the assumptions from Theorem 3.1, 
with probability at least 1 — e, the SOC algorithm will not 
make any errors on points coming from core TZi after m = 
5 ^ log(^) points from the corresponding cluster Ci have 
been seen in the saturation phase. 

Theorem 3.2 says that under a reasonable assumption re¬ 
garding the number of initial skeleton points per cluster and 
after the short initial subphase of the saturation phase, algo¬ 
rithm MergeOnlySOC classifies correctly all points coming 
from cores. In other words, we obtain the upper bound on 
the rate of convergence of the number of clusters produced 
by MergeOnlySOC to the groundtruth value. 

The proofs of Theorem 3.1 and Theorem 3.2 will be given 
in the Appendix. Below we give a very short introduction 
and present a useful lemma that we will rely on later. 





Fast Online Clustering with Randomized Skeleton Sets 


3.1. Merging Lemma 

Since both theorems consider the saturation phase of the 
algorithm, in the theoretical analysis whenever we will talk 
about the algorithm we will in fact mean its saturation 
phase. Without loss of generality we can also assume in 
our theoretical analysis that each skeleton point has weight 
one. Indeed, a point v that has weight Wy > 0 may be 
equivalently treated as a collection of Wy skeleton points 
of weight one (see: description of the algorithm). Let us 
formulate the following lemma: 

Lemma 3.1. Let us assume that at time t the algorithm 
merges two clusters: Pi and P 2 such that Pi contains a 
point from TZi and P2 contains a point from TZj for some 
i j. Then either: at least ah of all skeleton points of Pi 
at time t are outliers or: at least ah of all skeleton points 
of P 2 at time t are outliers. 


Proof The lemma follows from the definition of A, ac¬ 
cording to which, any two points taken from different cores 
are at least A distance apart. If two clusters: Pi, P 2 are 
merged at time t then there exists a data point v (a merger) 
and a ball B{v, such that B{v, contains at least ah 
skeleton points of Pi and at least ah skeleton points of P 2 
(see: Fig. 1). But B(v, y) cannot contain points from dif¬ 
ferent cores since they are A-separable. Thus at least one 
of the clusters from: {Pi, P 2 } has in B{v, at least ah 
skeleton points that are outliers. □ 

4. Experiments 

We evaluated the performance of the proposed SOC algo¬ 
rithm using four synthetic datasets as shown in Fig. 2 (left 
column). The sets contain data points in 20 dimensions. 
The first two dimensions were randomly drawn from pre¬ 
defined clusters, as shown in the figure, while the other 18 
dimensions was random noise. For the data sets B1 and 
B2, 1000 data points were randomly drawn from each of 
the two banana shaped clusters for the first two dimen¬ 
sions, Then 1000 (for Bl) and 2000 (for B2) outliers were 
randomly generated from a vicinity of the shapes, respec¬ 
tively. For the data sets LI and L2, 500 data points were 
randomly drawn from each of the four letter shaped clus¬ 
ters. Then 500 (for LI) and 2500 (for L2) outliers were 
randomly generated from a vicinity of the shapes respec¬ 
tively. The values in the other 18 dimensions for all the 
data were Gaussian white noise with a standard deviation 
of 0.01. All the data points were then randomly permuted 
so that their orders in the data stream would not affect the 
results. Examples of the datasets were plotted in the first 
two dimensions and shown in Fig. 2 (left column). We 
used the MergeSplitSOC version of the algorithm since it 
provided faster convergence of the number of clusters un¬ 
der the same quality guarantees. 


We compared the SOC method with several state-of-the- 
art nonparametric clustering methods, i.e., DBScan (Es¬ 
ter et al., 1996), Leader-Eollower algorithm (Duda et al., 
2000), Doubling algorithm (Charikar et al., 1997), and 
DenStream (Cao et al., 2006). The clustering quality was 
quantitatively evaluated using the average clustering purity, 
which is defined as 


\Ct\ 

\Ci\ 


( 1 ) 


where K is the number of clusters, [Cijis the number of 
points in cluster i, and \Cf\ is the number of points in clus¬ 
ter i with the dominant class label. 

Eig. 2 shows the comparative results of the SoC method as 
well as several other methods. Eig. 3 shows the clustering 
purity of all the methods. The SOC method requires two 
parameters {a and r). In all the experiments, we selected 
a — 0.03 and r = 0.07. All the results were produced 
using the best choice of parameters for each method. We 
note that parameter tuning was not trivial because most of 
the methods require at least two parameters. 

Results showed that the SOC method was able to cluster 
the data well, even though it slightly over-clustered in the 
datasets B1 and B2. The Leader-Eollower algorithm as well 
as streaming DBScan simply do not handle this type of 
data. SOC obtains similar results to these from DenStream 
algorithm (it produced slight overclustering, but obtained 
almost 100% purity). DBscan worked well on the banana 
sets, but it failed to cluster in L2, where the outliers over¬ 
numbered the true clusters. Doubling failed to work on 
the more noisy data sets (B2 and L2). Eor the other two 
datasets, the clustering purity was low, probably because of 
the noise in other 18 dimensions. Leader-Eollower method 
worked fine for LI and L2, but poorly for Bl and B2. It 
was mostly because of the nature of the method, and partly 
because of the noise in the other 18 dimensions. Standard 
variant of the Leader-Eollower uses only a small number 
of centers per cluster and when the clusters are not con¬ 
tained in convex well-separable objects, the recognition is 
very poor. DenStream worked well on all the cases. 

Our method is faster than DenStream (the running times 
per point varied from 60 to 90 microseconds for SOC and 
were above 100 microseconds for DenStream). The rea¬ 
son is that DenStream is not a purely online approach and 
performs offline clustering periodically. That part is a bit 
expensive. DenStream has another serious drawback. The 
id of the cluster the newly coming point is assigned to is 
computed based on the most recent snapshot of the offline 
clustering that was produced, not on-fly. Thus the accu¬ 
racy of the method depends heavily on the special param¬ 
eter determining how much data distribution evolves over 
time. Since parameter tuning is always very nontrivial for 
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DBscan Doubling Leader-Follower DenStream SOC 



Figure 2. Clustering results for the different methods. The left column shows four different synthetic datasets. The resulting clusters for 
each method are marked using different colors. Note that DenStream is a hybrid online-offline technique while SOC is a purely online 
method. 
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Figure 3. The clustering purity for different methods on the four 
synthetic data sets. 


the density-based methods, this one extra parameter adds 
another layer of difficulty. We do not need this parameter 
in our approach. 

The SOC method slightly overclustered in B1 and B2 be¬ 
cause of the online nature of the algorithm and the presence 
of the outliers. Nontheless, it was able to correctly throw 
out the outliers and produced results with high purity. This 
is because, even though outliers can become part of the 
skeleton set of a cluster, they are typically replaced by the 
true cluster points eventually as the true cluster points have 
a higher density and arrive at a higher rate than the outliers. 

Fig. 4 shows the skeleton points generated by SOC for two 
data sets. The maximum number of skeleton points used 
was only few hundred points. The number of skeleton 
points grows gradually as clusters become bigger. We set 



(b)Ll 


Figure 4. The skeleton points for the data sets (a) B1 and (b) LI. 
The data points are colored blue, and the skeleton points for dif¬ 
ferent clusters are marked with different colors. 


up the upper bound on the skeleton points per cluster as 
H = 400 but this number was not reached. 

5. Conclusions 

We have presented a new truly online clustering algorithm 
which can recover arbitrary-shaped clusters in the presence 
of outliers from massive data streams. Each cluster is repre- 
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sented efficiently by a skeleton set, which is continuously 
updated to dynamically adapt to the changing data distri¬ 
bution. The proposed technique is theoretically sound as 
well as fast and space-efficient in practice. It produced 
good-quality clusters in various experiments for nonconvex 
clusters. It outperforms several online approaches on many 
datasets and produces results similar to the most effective 
hybrid ones that combine online and offline steps (such as 
DenStream). In the future, we would like to investigate 
other methods for updating skeletons within given frame¬ 
work in the online fashion since this mechanism is crucial 
for the effectiveness of the presented approach. The other 
interesting area is the research on the maximal number of 
clusters that are created during the execution of the algo¬ 
rithm. A more precise bound will provide a more accurate 
theoretical estimate on the memory usage. 
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6. Appendix 

6.1. Proof of Theorem 3.1 

Proof From Lemma 3.1 we conclude that in order to find 
an upper bound on the event that the algorithm will at some 
point merge ’’wrong clusters”, it suffices to find the upper 
bound on the probability that the algorithm will at some 
point produce a cluster with at least ah skeleton points that 
are outliers and at least one nonoutlier v. Denote this latter 
event by 8. Denote by Si the intersection of 8 with the 
event that v comes from the core 7?., . Note that 8 = 8i-\- 

... 8k. 

For a set of points T> we say that a point v dominates T> if 
the following is true; At least one of the h random num¬ 
bers assigned to v by the algorithm is smaller than all cor¬ 
responding random numbers assigned to data points from 
I?\{t;}. Denote the set of all data points that are outliers 
as O. Let us fix a core TZi and some constant W . De¬ 
note by time stamps at which first W, W + 1,... 

points from TZiU O are collected. Let us first find a lower 
bound on the probability of the following event IF: for ev¬ 
ery tj in every ball of the (s, ^)-covering of TZi, there are 
at least ah points from TZi that are dominating the set of 
all the points collected up to time tj. Fix some ball B of 
the covering of TZi. By the definition of F^, we know that 
on average {W -f j)ri points from TZi D B arrived up to 
time tj. By Azuma’s inequality, we know that this num¬ 
ber is tightly concentrated around its average, i.e., for ev¬ 
ery ei > 0 the probability that for a fixed tj and fixed ball 
B this number is less than {W 4- j){ri — ei) is at most 
e“2tF«i jjj | 3 y using a more general version of the 
Azuma’s inequality, we can get rid of a fixed tj and have 
the same upper bound for all tjS simultanously. Consider 
the following event Q: for every ball B up to time tj for 
every j = 0,1,... at least {W -\- j)(ri — ei) data points 
from that ball have arrived. Thus, using union bound over 
all the balls of the covering, we conclude that Q happens 
with probability at least 1 — Now we analyze 

IF conditioned on Q. For any fixed tj, the average number 
of dominating points in the fixed ball B of the covering of 
TZi is at least h{Ti — ei). Besides, as previously, one can 
easily note that the actual number is tightly concentrated 
around the average. Taking 5 = F^ — ei — a and using 
Azuma’s inequality once again, we derive an upper bound 
of the form on the probability that a fixed ball B of 

the covering at some fixed time tj contains fewer dominat¬ 


ing points than we assumed above. Using this and taking 
the union bound over all tj and all s balls of the covering 
we get: q- , 

where stands for the complement of an event X. Let 
FL denote an event that among first W points from TZiU O 
there will be at most Wx outliers, where: x = 'yi{l 4- € 2 ) 

and €2 > 0 is some fixed constant. From Chernoff’s in- 

^2 

equality we get: < e~ ^+'2 Now assume that 

Ti happens and that W points from TZiU O have already 
arrived. Denote by I the following event; for all balls B 
of the covering of TZi at any time t' no ah skeleton points 
in any ball are outliers. Note that if I holds then after W 
points have been seen, at least ah — Wx new points from 
B that are outliers must be seen up to time t'. Fix again 
a ball B of the covering of TZi. Let us take next t points 
coming from TZiU O for t = ah — Wx, ah — Wx -f 1,... 
(we will take W in such a way that ah — Wx > 0). For 
a fixed t the probability that out of those t points there are 
more than (1 -f e 2 )tji outliers is, by Chemoff’s bound, at 
_ 

most e ^+'2 Denote by If the following event; for 
all t G {ah — Wx,ah — Wx -\- 1,...} there are at most 

(1 -f £ 2 )^ 7 * outliers out of those t points. By the union 

,2 

bound we have: P(J''^) < ne ^+'2 Assume that J 
holds and fix t. The probability that after t new points 
have been seen, one of the outliers will become a skele¬ 
ton point in a fixed ball B based on the fixed random 
number that was assigned to it by the algorithm is at most 
(1 -f £ 2 ) 7 ^. The probability that this will be the case for 
at least ah — Wx outliers is at most ((1 -f £ 2 ) 71 )“^”'^^- 
Thus, if we take the union bound we can conclude that 

P(I'^) < n{e~ ^+'2 _|_ £2)7,)“^“'’'^®). Notice that 

8 i C IF^UH^U lU Thus we obtain P(£:i) < P(J'^) -f 
P('H°) 4- P(X'^). We conclude that P(£i) < 4- 

(1 — -(- g“_j_ n(e~ ^+'2 -|- 

s((l 4- £ 2 ) 71 )“^”'^^)- Thus, taking the union bound over 

all the cores we obtain: P(£’) < -4- ksne~^^^ 4- 

^2 

k{n 4- l)e~ ^+'2 -|- kns{{l 4- £ 2 ) 71 )“^”^^- If we fix: 

ei = a — ^ and W = then by bounding each 

term of the RFIS expression in the last inequality with 
we obtain the lower bound on h as in the statement of the 
theorem. Since we have already noticed that it suffices to 
find an appropriate upper bound on P(£), this concludes 
the proof. □ 

6.2. Proof of Theorem 3.2 

We will use notation from the proof of Theorem 3.1. 

Proof. Fix some TZi. Note than we have already proved 
that an event IF does not hold with probability at most 
gg-2Wei _|_ ^ 2 ^ _ ^ note that from 
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the definition of T we know that if T holds then the num¬ 
ber of clusters computed by the algorithm and containing 
at least one point from TLi will not increase after the time 
when first W data points from the corresponding cluster Ci 
have been seen. Let us assume that T holds. Notice that 
by the time every ball of the covering gets at least one new 
data point, there will be just one cluster computed by the 
algorithm with points from the core TZi. When this is the 
case no further errors regarding new points from the core 
TZi will be made. From Azuma’s inequality and the union 
bound we immediately get; the number of extra data points 
coming from Ci that need to be seen to populate every ball 
of the covering with at least one of them is more than T 
with probability at most We conclude that with 

probability at most after 

T + W points of Ci have been already seen, the algorithm 
will still make mistakes on the new data points from TZi. If 
we now upper-bound every ingredient of the above sum by 
I and solve for W and T, then we get: W > ^ log(^), 
T > 2 ^ Taking the number of points h as in the 

previous theorem concludes the proof. □ 




